Add EPYC CPU serving skill (vLLM + zentorch)#76
Conversation
Signed-off-by: Lalithnarayan C <Lalithnarayan.C@amd.com> Change-Id: I1dc2362e0983326658b6618015a161ecd44f40e6
|
@Mahdi-CV Can you help review this? |
|
hi @danielholanda @Mahdi-CV @shailensobhee, can we move ahead with review and CI for this PR? thanks! |
|
Hi @amd-lalithnc , a few things so far. Having benchmarked ZenDNN 6.0 recently, I noticed this versioning issue, that maybe you'd want to clarify in the SKILL itself: Do you agree on this observation? If yes, we may need to clarify this in the SKILL file and associated documentation.
Conclusion: We can merge this skill, but there are potential performance aspects to narrow down. Thoughts? cc: @Vkathail |
Signed-off-by: Lalithnarayan C <Lalithnarayan.C@amd.com> Change-Id: I6442cc19df3caa3e0e5f36cc276bf94550d5a95e
|
hi @shailensobhee, thanks for your thoughts!
let me know if the changes are suitable. thanks! |
shailensobhee
left a comment
There was a problem hiding this comment.
Approving. All three points I raised earlier have been addressed in the code, verified against the head commit:
- vLLM version clarity -
data/epyc.jsonpinsvllm_version: 0.22.0with the matching public container tag (amdih/zendnn_zentorch:vllm_v0.22.0_zentorch_v2.11.0.1_...). Pinning the public stable 0.22.0 is correct while 0.23.0 is still in validation. The 0.23 / zentorch 2.11TORCHINDUCTOR_FREEZINGcrash gotcha is documented. - Dual-socket selection -
cpu_tune.pynow samples per-socket load from/proc/stat, prefers a free socket, falls back to the least-busy one with a warning when both are busy, and supports--socket Nto force. - KV-cache locality - memory is now bound to the chosen socket (
numactl --cpunodebind/--membindfor conda,--cpuset-memsfor containers) and KV cache is sized from that socket's local RAM, not whole-system RAM. NPS2/NPS4 multi-node cases emit a note.
Note on CI: the behavioral checks are red due to a CI infra issue, not the skill. The eval harness fails at setup in conftest.py because the runner's claude judge CLI is not authenticated (Not logged in / Please run /login), so zero behavioral assertions actually executed. This is expected for a fork PR where Actions secrets are withheld. All substantive checks pass: skill validation, manifest validation, SkillSpector security scan, and external-reference checks. Recommend a maintainer with CI-secret access re-run the behavioral job (or run it from an in-repo branch) to get a clean green before merge.
What
Adds
serving-llms-on-epyc: a skill that brings up a single vLLM OpenAI endpoint on anAMD EPYC CPU host with the zentorch backend, in a container (Docker/Podman) or a conda env.
Flow
detect.py).validate.py): container runtime (docker/podman) or conda fallback;image present, and if already pulled,
import vllm, zentorchinside it; host perf libraries(tcmalloc / OpenMP via
LD_PRELOAD);HF_TOKEN; RAM.check_model.py): confirm vLLM supports the architecture via itsmodel registry (text or multimodal); reject pooling / non-LLM (not chat endpoints).
Gated models require
HF_TOKEN+ license acceptance.estimate_memory.py): weights + KV cache + headroom ≤ host RAM.cpu_tune.py): bind to socket 0's physical cores andset
VLLM_CPU_KVCACHE_SPACE; no memory binding by default (NPS2/NPS4 get a perf note).vllm serve(never--device cpuon vLLM ≥ 0.20)./health, validate the/v1/chat/completionsendpoint, then print aconnection table.
Single instance. On any failure it reports the cause + logs and stops, no retry, no debugging loop.
Contents
SKILL.md,reference.md,skill-card.md,data/epyc.jsondetect.py,validate.py,check_model.py,estimate_memory.py,cpu_tune.pyeval/behavioral/tests/test_serving_llms_on_epyc.py.claude-plugin/marketplace.json(+ regenerated Cursor manifest)Notes / scope
amdih/zendnn_zentorchimage on Docker Hub.TORCHINDUCTOR_FREEZING=1requiresVLLM_USE_AOT_COMPILE=0.OMP_NUM_THREADSandVLLM_CPU_NUM_OF_RESERVED_CPUare intentionally left unset — vLLM derivesthem (from the bind list / its own default).
Testing
check.sh): passes (0 errors).estimate → cpu_tune → confirm, plus the guardrails. Live launch/serve is the manual /
integration tier on a real EPYC host.
Change-Id: I1dc2362e0983326658b6618015a161ecd44f40e6